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Abstract. We present a simple library which equips MPI implementations with 
truly asynchronous non-blocking point-to-point operations, and which is in- 

ry dependent of the underlying communication infrastructure. It utilizes the MPI 

profiling interface (PMPI) and the MPI_THREAD_MULTIPLE thread compatibil- 

OO ity level, and works with current versions of Intel MPI, Open MPI, MPICH2, 

MVAPICH2, Cray MPI, and IBM MPI. We show performance comparisons on 
a commodity InfiniBand cluster and two tier-1 systems in Germany, using low- 

rj level and application benchmarks. Issues of thread/process placement and the 

Q peculiarities of different MPI implementations are discussed in detail. We also 
identify the MPI libraries that already support asynchronous operations. Finally 
CZ2 we show how our ideas can be extended to MPI-IO. 
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q 1 Introduction 

OO 

A widespread misconception about MPI's non-blocking point-to-point and I/O routines 
is that communication and I/O necessarily overlaps with computation. According to 
CNl the MPI standard [11] non-blocking semantics does not require asynchronous progress. 

Some MPI implementations do support asynchronous progress. However, this feature 
often needs to be explicitly enabled at compile time (e.g., with "progress threads"), 
requires the cooperation of several components of the used software stack, or needs 
special start-up parameters. 

k> Surprisingly, many applications show performance improvements when non- 

5_i blocking point-to-point communication is employed, even when the MPI library does 

not feature asynchronous progress. This is because of the other beneficial consequences 
of using non-blocking MPI, such as full-duplex transfers and avoidance of frequent ex- 
plicit mutual synchronization. 

In this work we describe a library which achieves asynchronous data transfer by 
utilizing the profiling interface of MPI (PMPI) and a separate progress thread. Therefore 
the MPI implementation must support the MPI-2 standard [12] and must provide the 
MPI_THREAD_MULTIPLE compatibility level to allow several threads of a process to 
perform concurrent calls to MPI routines. 

The library supports the C and Fortran MPI interfaces and requires no code changes 
to the target application. If the target application is statically linked against the MPI 



library, relinking may be required. The method should work independently of the un- 
derlying interconnect. 

This paper is organized as follows: In Sect. 2 we review related work on asyn- 
chronous data transfer with non-blocking point-to-point MPI or MPI I/O. The imple- 
mentation details of the library are discussed in Sect. 3, and the test bed (hardware and 
software used for benchmarking) is introduced in Sect. 4. In Sect. 5 we evaluate the 
point-to-point messaging capabilities of our library using low-level benchmarks and 
hybrid-parallel sparse matrix vector multiplication. Asynchronous I/O is discussed in 
Sect. 6. Sect. 7 gives a conclusion and an outlook. 

2 Related Work 

Overlapping data transfer can be achieved on three different levels [5] : either by manual 
progression, progress threads, or communication offload. 

Offloading At the lowest level the transfer can be offloaded to the corresponding 
network interface, if supported. In principle this can be done for example with Myrinet 
or InfiniBand if the host channel adapter (HCA) is capable of it. In [8] KOOP et al. 
describe a protocol which enables full asynchronous progress by completely offloading 
the message transfer and matching to the InfiniBand HCA. 

Progress Threads Another option to handle communication and/or I/O while the user 
application can proceed is to use dedicated threads. One option is to have these threads 
be controlled by the MPI library. This technique was used for Open MPI until version 
1.5.3 [18]; however, Open MPI could still support threads in some layers of its archi- 
tecture. Other implementations, such as MPICH2, MVAPICH2, and Cray MPI [2] also 
feature special settings for enabling internal progress threads. Also several not so famil- 
iar MPI implementations have been built with this idea in mind, e.g. FiTMPI in Mao et 
al. [10] or USFMPI in CAGLAR et al. [1]. In [5] Hoefler et al. analyze the impact of 
progress threads for non-blocking collectives inside their own reduced MPI implemen- 
tation. Their findings are that polling for progress is beneficial if spare cores are avail- 
able. Interrupt-driven progress threads are advantageous if all cores are fully utilized. 
Dickens et al. [3] use progress threads for collective MPI-IO. They have found that 
naive usage of threads decreases performance. This might no more be true on current 
systems as the amount of available cores has increased. A similar approach has been 
followed by PATRICK et al. [13] by spawning a thread when the non-blocking I/O func- 
tions are called. The thread then calls the blocking counterpart. Shahzad et al. [16] 
use explicit progress threads inside an application for performing checkpoints of the 
application. Here the performance of the application was only marginally reduced com- 
pared to the case without any checkpointing. Using application-level progress threads 
Schubert et al. [15] could significantly improve the performance of hybrid-parallel 
sparse matrix-vector multiplication. 

Manual Progression The idea of manual progression is to repeatedly call MPI_Test 
to check for completion. Hereby the assumption is made that every call into the library 
drives the progress. An evaluation of how frequent these calls should be performed has 
been done by Hoefler et al. in [6]. 

Benchmarks White and BOVA [19] have performed an early investigation of asyn- 
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Fig. 1. (a) The APSM intercepts the initialization stage of the MPI library and enforces the 
MPI_THREAD_MULTIPLE level. Finally the progress thread is started, (b) The non-blocking MPI 
calls (e.g., MPI_Isend) are intercepted and executed by the application thread. The returned 
request handle is put into an internal queue, which is consumed by the progress thread. A gen- 
eralized request handle is returned to the application. This handle will receive the status of the 
original request once it has been completed. 



chronous progress in MPI libraries. In [9] Lawry et al. describe a benchmark for de- 
tecting possible overlap. The Sandia MPI Micro-Benchmark Suite [14] contains a com- 
ponent that measures the host processor overhead during non-blocking MPI send and 
receive operations. The overhead introduced with using the MPI_THREAD_MULTIPLE 
level was analyzed by THAKUR et al. in [17]. Depending on the implementation quality 
of the library the overhead ranges from negligible to large. 



3 Solution 

The APSM library (Asynchronous Progress Support for MPI) is designed to work with 
every MPI library and any interconnect, as long as following conditions are met: 

- Every call into the MPI library drives the internal progress engine. 

- The MPI library supports MPI_THREAD_MULTIPLE. 

- The MPI library supports the MPI profiling interface (PMPI), i.e., for every rele- 
vant MPI symbol (MPI_Xxx. . .) there is a corresponding symbol PMPI_Xxx. . .. 
A library can then implement the MPI functions it wants to intercept and can then 
call the corresponding PMPI routine. 

The PMPI interface is used to intercept all MPI calls that are relevant for non-blocking 
point-to-point messages or non-blocking MPI-IO. 



3.1 Initialization and Finalization 

The initialization process is depicted in Fig. la. To setup the library the MPI_Init* 
functions are intercepted. In MPI_Init_thread the MPI_THREAD_MULTIPLE level is 
requested; if the library does not support this level, the application is aborted. (For 
convenience calls to MPI_Init are rerouted to MPI_Init_thread so that applications 



which do not call the threaded initialization require no code changes.) Next, the progress 
thread is created using pthread_create. In our experience the Pthreads primitives 
used by APSM do not interfere with any other threading model employed in the user 
program, such as OpenMP. 

The progress thread is terminated by intercepting MPI_Finalize, which first stops 
the progress thread before calling PMPI_Finalize. 

3.2 MPI Point-to-Point Functions 

All intercepted non-blocking point-to-point functions are handled in the same way (see 
Fig. lb). If such a function is called by the application the actual requested operation, 
e.g., MPI_Isend, is carried out by calling the corresponding PMPI function. 

The returned request handle is enqueued to an internal queue, and a newly crated 
generalized request handle is returned to the application. This handle will act as a 
"proxy" of the original request. This process is transparent to the application, and no 
code changes are necessary. 

The queued original requests which are bound to the message transfers are pro- 
cessed from the internal queue by the progress thread. If, at any time, a bunch of 
requests is waiting in the queue, they are served simultaneously by calling either 
MPI_Test(some I any) or MPI_Wait (some I any) for driving the progress of data 
transfer and waiting for the completions. Which of the four alternatives is chosen in 
practice depends on the MPI library (see below). If a request completes, its status is 
propagated to the associated generalized request, notifying the application. 

Since all non-blocking MPI functions return an MPI request handle, this method 
will work for them. Note that the call to the PMPI functions happens still in the context 
of the application's thread. This is necessary to provide a correct MPI program, since 
a pair of matching non-blocking send and receive calls in the same MPI process are 
guaranteed to complete. This would not be ensured if the PMPI calls were done inside 
the progress thread and MPI_Wait (some I any) were used to wait for completion: After 
the application posts the send, the progress thread would detect it in the internal queue, 
execute it, and wait for it. Then, a matching receive could be posted next by the appli- 
cation, but it would never be handled by the progress thread, which would wait forever 
for the completion of the send request. 

3.3 MPI-IO Functions 

Calls to non-blocking MPI-IO functions are handled slightly differently. The call to the 
PMPI function (e.g. PMPI_File_iwrite) is performed in the context of the progress 
thread. Since the MPI standard allows MPI-IO progress to occur within the initial non- 
blocking call, this is the more general (and, in this case, safe) way to ensure asyn- 
chronous I/O. 

3.4 Fortran Interface 

The Fortran interface poses a slight problem, since different MPI implementations use 
different strategies to implement the MPI interface. Internal MPI routines may be called 
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Table 1. Test bed for evaluation. For measuring the attainalbe interconnect bandwdith the ping- 
pong part of the Intel MPI benchmarks was used. 



directly from the Fortran interface, or the Fortran interface may be stacked on top of the 
C interface, so that Fortran MPI calls are just rerouted. The library is built to cope with 
both situations. Moreover it detects the symbol convention for the Fortran routines au- 
tomatically: As there is no standard for how Fortran compilers name their symbols, the 

C MPI_Isend function could be called, e.g., mpi_isend, mpi_isend_, mpi_isend , 

or MPI_ISEND, to name only the most common ones. 

3.5 Affinity of the Progress Thread 

Since the application and the MPI library are not aware of the progress thread, they can- 
not control its affinity (i.e., which logical core it is bound to) in a meaningful manner. 
If there is no general way to handle excess threads (e.g., with the -d option in Cray's 
mpirun), this can be set by the environment variable MPI_ASYNC_CPU_LIST. The spec- 
ified core list relates to the MPI processes on a node. For example, the list 0_2_4 would 
pin the progress thread of the first MPI process on every node to core 0, the progress 
thread of the second MPI process to core 2, and the progress thread of the third MPI 
process to core 4. 3 

4 Test Bed 

We have used three cluster systems for our tests: "Lima" at the Erlangen Regional 
Computing Center (RRZE) in Erlangen, Germany, "SuperMUC" at the Leibniz Su- 
percomputing Center (LRZ) in Garching, Germany, and "Hermit" at the High Perfor- 
mance Computing Center (HLRS) in Stuttgart, Germany. Their system parameters can 
be found in Table 1 . 



3 There is the residual problem of how to determine the number of processes per node. This can 
be solved in a general way, but an in-depth description would be out of scope for this work. 



5 Overlap of Point-to-Point Messages 



In this section we demonstrate the capabilities of the APSM library using simple 
low-level benchmarks and a hybrid-parallel sparse matrix-vector multiplication kernel. 
Since it is impossible to show all benchmark results due to the vast number of param- 
eters, we will concentrate the discussion on the most prominent aspects. The complete 
set of benchmark results can be reviewed online [20]. 

5.1 Simple Overlap Benchmark 

A simple benchmark is used to test the ability of MPI libraries to overlap computa- 
tion with communication using non-blocking point-to-point calls [4]. Two MPI pro- 
cesses are run, each on its own compute node. The first process initiates a non- 
blocking send (MPI_Isend), performs (CPU-bound) work for a time t w , and finally 
calls MPI_Wait. The total time t t taken for all three steps is measured. The second pro- 
cess immediately posts a blocking receive (MPI_Recv). This can also be varied with 
MPI_Irecv/MPI_Send or MPI_Isend/MPI_Irecv instead of MPI_Isend/MPI_Recv. 

When t t is plotted against t w and no asynchronous progress has occurred a straight 
line can be seen with an offset on the y axis: 

t t =t c + t w , (1) 

where the communication time t c = V/Bn + t[. Here V is the message volume, Bn is 
the network bandwidth, and t\ is the network latency. If there is fully asynchronous 
progress, we have 

tt = max (t c ,t w ) . (2) 

The results for Intel MPI 4.0.3 are shown in Fig. 2a. Intel MPI currently does not 
provide asynchronous progress over InfiniBand. Using it together with APSM pro- 
vides full overlap for non-blocking point-to-point communication and "large" messages 
(see below for details on how smaller messages must be handled). The only test/wait 
function of MPI which was usable in the progress thread without deadlocking was 
MPI_Waitany. We attribute this to problems with thread safety in Intel MPI. 

Open MPI (version 1.6.3) provides overlap for non-blocking point-to-point commu- 
nication, at least for MPI_Isend. However, Open MPI cannot use InfiniBand with the 
MPI_THREAD_MULTIPLE threading level as the corresponding OpenIB module is not 
thread safe. In this case the implementation falls back to TCP, which in our case takes 
place using IP-over-IB or Gigabit Ethernet. This can be seen from the simple ping-pong 
benchmark in Fig. 2b. Here only around 200 MB/s compared to the 3.0 GB/s with IB 
can be achieved when the highest threading level is requested. 

MPICH2 only supports Gigabit Ethernet (GE) and no InfiniBand. With the inter- 
nal progress thread enabled (by settingMPICH_ASYNC_PROGRESS=l) overlap can be 
achieved. APSM can be used as an alternative. 

MVAPICH2 (version 1.9a2) can overlap non-blocking point-to-point mes- 
sages with communication if the internal progress threads are enabled via 
MPICH_ASYNC_PR0GRESS=1. This MPI library also works with APSM. 




(a) Overlap benchmark w/ Intel MPI 4.0.3 (b) Ping-Pong benchmark w/ Open MPI 1.6.3 



Fig. 2. (a) Overlap benchmark with Intel MPI on Lima for MPI_Isend/MPI_Recv pair over IB 
with (dashed line) and without (straight line) APSM. The chosen message size is a parameter, (b) 
PingPong benchmark with Open MPI on Lima. Compiling Open MPI with thread safety support 
introduces a small overhead (circles vs. squares). When the thread level MPI_THREAD_MULTIPLE 
is additionally requested, the InfiniBand transfer module fails to load as it is not thread safe, and 
Open MPI falls back to TCP/IP (triangles). 



IBM MPI (version 1.2) can by default only overlap MPI_Isend with computa- 
tion. However specifying MP_CSS_INTERRUPT=yes [7], which causes arriving packets 
to generate interrupts, leads to overlapping behavior in all other situations. Utilizing 
APSM delivers in principle the same result, but introduces a lot of variability in execu- 
tion times; sporadically, MPI calls take an exceedingly long time. The reason for this 
behavior has not been investigated yet. 

Cray MPI in the standard configuration provides no asynchronous message trans- 
fer, but it supports an option to activate an extra progress thread by setting the environ- 
ment variable MPICH_NEMESIS_ASYNC_PR0GRESS=1 [2] and reserving one core with 
the aprun option -r 1 for the additional thread. With the simple overlap benchmark 
this yields better results than with APSM. 

Table 2 summarizes these results in columns 3 and 5. 



5.2 Prototype Ghost Cell Exchange Benchmark 

This benchmark simulates strong scaling of an application which performs exchange of 
ghost cells in one dimension. A number of MPI processes are running, each of which ex- 
changes a "halo" of fixed size with its two neighbors. After the exchange, each process 
executes a workload which is subject to strong scaling with the number of processes. 
Computations which would be required in a real application for the boundary cells in 
preparation of the halo exchange are neglected. 

In order to better mimic the execution behavior of real applications but still achieve 
good reproducibility of time measurements, a simple triad loop benchmark (a( : ) = 
b(:) * c(:) + d(:)) was chosen as the workload. The size of the working set was 
adjusted to fit into each core's own L2 cache. 
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Table 2. O verview of all MPI implementations and the system they were evaluated on. 
a Only MPI_Waitany can be called inside the d No IB with APSM 
progress thread e With progress thread enabled 

b I/O deadlocks f With interrupts enabled 

c Only MPI_Isend overlaps g Strong variability in measurement 



We used the Lima cluster with Intel MPI and up to twelve nodes, with twelve MPI 
processes per node (PPN). Each process was bound to its own physical core. In the case 
where the APSM library was used, the progress thread was pinned to the other SMT 
thread. 

The performance results for a communication buffer size of 10 MiB can be found in 
Fig. 3a. Use of the APSM library achieves superior performance and scalability up to 
the point where communication takes longer than computation (at about three nodes). 
This can be seen from Fig. 3b, which shows a breakdown of time contributions in the 
duration of work, i.e. the computation (filled symbols), and the "visible" communica- 
tion time, which is in case of overlap the difference between overall time and working 
time (open squares). Without overlap, the communication time is independent of the 
number of processes (open circles), since the message length is always the same. 

Beyond three nodes, performance saturates in the overlapped case, whereas it con- 
tinues to rise without overlap. At large node counts, both numbers coincide, since com- 
munication absolutely dominates in this case and computation time is negligible. How- 
ever, this is not the limit in which one would want to run any real application code in 
practice, since parallel efficiency has dropped to unacceptable levels. The "sweet spot" 
in terms of efficient execution is at the point where the overlapped performance sat- 
urates. This is also where the advantage compared to the non-overlapped case is at a 
maximum. 

As the difference in t w between overlapped and non-overlapped cases shows, the 
progress thread requires additional resources, reducing the worker thread's performance 
accordingly. In cases where spare physical cores are available, e.g., if the application 
is strongly memory-bound with saturation across the cores of a socket, the progress 
threads can be bound to those. This would strongly reduce the interference of MPI 
progress with application execution. 

5.3 Sparse Matrix Vector Multiplication (spMVM) 

We use a hybrid (OpenMP + MPI) sparse matrix vector multiplication (y = y + M x\) 
as a relevant real-world test case to demonstrate the applicability of our approach. MPI 
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Fig. 3. Ghost cell benchmark on Lima with Intel MPI for one to twelve nodes and twelve MPI 
processes per node, (a) Performance with plain Intel MPI (circles) and with APSM (squares), (b) 
Breakdown of computation (t w ) and (visible) communication times. 

parallelization is done by distributing the matrix rows across processes so that each 
process has (approximately) the same number of nonzero entries. The right-hand side 
(RHS) vector v is distributed in the same way. Consequently, overlapping computation 
with the required communication of the RHS vector parts requires splitting the spMVM 
operation in two phases: A "local" phase, in which a process multiplies its local part of 
the RHS vector to the corresponding diagonal block of the matrix, and a "non-local" 
phase, in which the parts of the RHS vector that have been received by other processes 
are multiplied to the remaining matrix entries. 

There are two ways in which communication overlap may be achieved, "vector 
model with naive overlap" and "task mode with explicit overlap" [15]. The former uses 
all OpenMP threads to perform the local spMVM part and relies on non-blocking MPI 
calls and a subsequent MPI_Waitall for asynchronous MPI progress, while the latter 
uses a dedicated communication thread. After the local spMVM and the communication 
are both over, both approaches perform the non-local spMVM with all threads (see [15] 
for a full description). 

If the MPI implementation does not support asynchronous progress, communication 
only takes place during the MPI_Waitall call. This results in overlap only with task 
mode, at the price of sacrificing one worker thread for the local spMVM. 

For evaluation we selected two sparse matrices "HV15R" and "DLR1," where the 
former has about 2 • 10 6 rows and 2.8 • 10 8 nonzeros, and the latter has 2.8 • 10 5 rows and 
4 • 10 7 nonzeros. DLR1 uses only around 480 MB of memory and thus fits completely 
in the L3 caches of 24 Lima compute nodes. 

Task mode shows best performance with Intel MPI in all cases (see Fig. 4). For 
the HV15R matrix (Fig. 4a) vector mode with APSM is better than without and nearly 
achieves the performance of task mode. The DLR1 matrix reveals a specific problem 
due to the very small message sizes that occur as the number of MPI processes is in- 
creased. These are usually handled by MPI implementations using a so-called eager 
protocol: If a message is small enough it can directly be sent to a predefined buffer at 
the destination. Otherwise the sender and the receiver must synchronize to initiate the 
actual transfer {rendezvous protocol). 
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(a) HV 1 5R matrix (b) DLR 1 matrix 

Fig. 4. Performance comparison of spMVM with Intel MPI on Lima, running one MPI process 
with twelve OpenMP threads per node 

If every message were handled in the same way by the APSM library independent 
of size, the latency for potential eager messages would increase since the whole mech- 
anism of processing queued requests by the progress thread would add nothing but 
overhead. When the library is made aware of the threshold, request handles for eager 
messages are directly obtained from MPI and passed back to the application, with no 
interference from the progress thread. The difference between both behaviors can be 
seen in Fig. 4b, where the performance of APSM without eager awareness becomes 
unacceptable beyond 16 nodes (diamonds). With eager awareness, however (at a mes- 
sage size of < 256 KiB in this case), the performance reaches the level of vector mode 
without APSM (triangles). The reason why task mode is still measurably better at large 
node counts is that all overheads connected with MPI communication can be hidden by 
an explicit communication thread, whereas the eager protocol alone (which is also in 
effect when APSM is used with eager awareness) still suffers from unavoidable com- 
munication latencies. 

6 Overlap of MPI-IO 

The APSM library can also be used to overlap computation and MPI-IO. To evaluate 
the state of the MPI implementations and the usefulness of APSM a modified version of 
the overlap benchmark (see Sect. 5.1) was used, where point-to-point communication 
was substituted by I/O via MPI_File_iwrite. Only one process per node was used, 
writing 6 GiB of data to a parallel file system. All processes wrote to the same file. Care 
was taken to rule out caching effects, i.e., the measured I/O times included real disk I/O 
only. In general, getting reliable timing for I/O is not easy since the parallel filesystems 
are usually under load by other users, which leads to fluctuating bandwidths. 

The results can be summarized as follows. The Intel MPI library does not overlap 
computation and I/O. Using it together with APSM is not possible due to frequent dead- 
locks. Open MPI does not provide asynchronous I/O on the Lustre filesystem of Lima, 
but overlap can be seen with APSM (see Fig. 5a). The main disadvantage with this so- 
lution is that Open MPI can not use native InfiniBand for point-to-point communication 
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Fig. 5. Results for the MPI-IO overlap benchmark on Lima using (a) Open MPI and (b) MVA- 
PICH2, on eight Lima nodes, one process per node, and 6 GiB of I/O volume per process. 

with APSM (see Sect. 5.1 above). MVAPICH2 does not support asynchronous I/O even 
with its internal progress thread enabled. However, overlap is available with APSM (see 
Fig. 5b). Cray MPI does not feature asynchronous I/O (with and without the internal 
progress thread) but can benefit from the APSM library. 

Table 2 gives an overview of the results in columns 4 and 7. 

7 Summary 

We have demonstrated how asynchronous point-to-point communication and MPI-IO 
can be achieved for MPI implementations that have no native support for asynchronous 
progress. By overloading some MPI functions using the PMPI interface, an internal 
progress thread is used to handle non-blocking requests in the background with minimal 
impact on the performance of code execution in the application program. In cases where 
no dedicated physical cores are available for the progress thread, virtual cores can be 
used. Most current MPI implementations are compatible with our method. However, 
strict MPI_THREAD_MULTIPLE compatibility and thread safety is required. 

Possible future work includes support for persistent communication and split- 
collective MPI-IO functions. The library is freely available under an LGPL license [20]. 
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